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ABSTRACT 

Similarity search methods are widely used as kernels in var- 
ious data mining and machine learning applications includ- 
ing those in computational biology, web search/clustering. 
Nearest neighbor search (NNS) algorithms are often used 
to retrieve similar entries, given a query. While there exist 
efficient techniques for exact query lookup using hashing, 
similarity search using exact nearest neighbors suffers from 
a "curse of dimensionality", i.e. for high dimensional spaces, 
best known solutions offer little improvement over brute 
force search and thus are unsuitable for large scale streaming 
applications. Fast solutions to the approximate NNS prob- 
lem include Locality Sensitive Hashing (LSH) based tech- 
niques, which need storage polynomial in n with exponent 
greater than 1, and query time sublinear, but still polyno- 
mial in n, where n is the size of the database. In this work 
we present a new technique of solving the approximate NNS 
problem in Euclidean space using a Ternary Content Ad- 
dressable Memory (TCAM), which needs near linear space 
and has 0(1) query time. In fact, this method also works 
around the best known lower bounds in the cell probe model 
for the query time using a data structure near linear in the 
size of the data base. 

TCAMs are high performance associative memories widely 
used in networking applications such as address lookups and 
access control lists. A TCAM can query for a bit vector 
within a database of ternary vectors, where every bit posi- 



tion represents 0, 1 or *. The * is a wild card representing 
either a or a 1. We leverage TCAMs to design a variant 
of LSH, called Ternary Locality Sensitive Hashing (TLSH) 
wherein we hash database entries represented by vectors in 
the Euclidean space into {0, 1, *}. By using the added func- 
tionality of a TLSH scheme with respect to the * character, 
we solve an instance of the approximate nearest neighbor 
problem with 1 TCAM access and storage nearly linear in 
the size of the database. We validate our claims with exten- 
sive simulations using both real world (Wikipedia) as well as 
synthetic (but illustrative) datasets. We observe that using 
a TCAM of width 288 bits, it is possible to solve the approx- 
imate NNS problem on a database of size 1 million points 
with high accuracy. Finally, we design an experiment with 
TCAMs within an enterprise ethernet switch (Cisco Cata- 
lyst 4500) to validate that TLSH can be used to perform 
1.5 million queries per second per IGb/s port. We believe 
that this work can open new avenues in very high speed data 
mining. 
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1. INTRODUCTION 

Due to the explosion in the size of datasets and the in- 
creased availability of high speed data streams, it is has be- 
come necessary to speed up similarity search (SS), i.e. to 
look for objects within a database similar to a query ob- 
ject, which is a critical component of most data mining and 
machine learning tasks. For example, consider searching for 
similar images within a corpus of billions of images and re- 
peating this for a query set consisting of millions of images 
using as little power and computation time as possible. One 
could use the streaming model and stream the corpus over 



the latter set. In order to do this, one would typically deploy 
very fast computing devices or distribute it over several com- 
pute devices. In this paper, we show how this goal can be 
achieved with just an associative memory module, ternary 
content addressable memory (TCAM) 31 , commonly used 
in networking for route lookups and access control list (ACL) 
filtering, to perform a specific variant of SS, i.e. determine 
the approximate nearest neighbor for the L2 or Euclidean 
space. 

Common tasks in mining and learning depend heavily on 
SS. For example, clustering algorithms are designed to max- 
imize intra cluster similarity and minimize inter cluster sim- 
ilarity. In classification, the label of a new query object is 
determined based on its similarity to trained (labeled) data 
and their labels. In several applications of SS such as in con- 
tent based search, pattern recognition and computational bi- 
ology, objects are represented by a large number of features 
in a high dimensional (metric) space, and SS is typically im- 
plemented using nearest neighbor search routines. Given a 
set consisting of n points, the nearest neighbor search prob- 
lem ''29] builds a data structure which, given a query point, 
reports the data point nearest to the query. For example, 
nearest neighbor methods and their variants have been used 
for classification purposes [T5], stream classification [2] and 
clustering heuristics [9]. Applications of SS range from con- 
tent search, lazy classifiers, to genomics, proteomics, image 
search, and NLP [Ml El IIZl El HI ES IS] • 

Existing solutions to the exact nearest neighbor problem 
offer little improvement over brute force linear search. The 
best known solutions to exact nearest neighbor include those 
which use space partitioning techniques like fe-d trees [8], 
cover trees [10) . navigating nets [23]. However these tech- 
niques do not scale well with dimensions. In fact an ex- 
perimental study [35] indicates when number of dimensions 
is more than 10, space partitioning techniques are in fact 
slower than brute force linear scan. 

A class of solutions that have shown to scale well are those 
that are based on locality sensitive hashing (LSH) [4] which 
solve the approximate nearest neighbor problem. The c- 
Approximate Nearest Neighbor problem (c-ANNS) allows 
the solutions to return a point whose distance to the query 
is at most c times the distance from the query to its nearest 
neighbor. A family of hash functions is said to be locality- 
sensitive if it hashes nearby points to the same bin with high 
probability and hashes far-off points to the same bin with 
low probability. To solve the approximate nearest neighbor 
problem on a set of n points in d dimensional Euclidean 
space, the data points are hashed to a number of buckets 
using locality-sensitive hash functions in the pre-processing 
step. To perform a similarity search, the query is hashed us- 
ing the same hash functions and the similarity search is per- 
formed on the data points retrieved from the corresponding 
buckets. In the last few years LSH has been extensively used 
for SS in diverse applications including bioinformatics [111 
I17| . kernelized LSH in computer vision 24 , clustering [20| . 
time series analysis [3T]. For the Euclidean space, the opti- 
mal LSH based algorithm which solves the c-ANNS problem 
has a space requirement of 0{ri^^^^^'^ -*) and a query time 
oiOin^^'" '). For c 1, this near quadratic space require- 
ment of LSH and query time sub-linear(but still polynomial) 
in n, make it difficult to use LSH in streaming applications, 
especially at extremely high speeds, which are beyond the 




LSH TLSH 

Figure 1: A comparison of LSH and TLSH: We 
choose a random direction Ui and consider a family 
of hyperplanes orthonormal to it, adjacent hyper- 
planes being separated by 5. The LSH family hashes 
regions between the hyperplanes to 0, 1, 0, 1 . . ., while 
the TLSH family hashes the regions between the hy- 
perplanes to 0, *, 1, *, 0, * . . . 



capability of a CPU. In such a scenario, we look for hardware 
primitives to accelerate c-ANNS. 

In this paper, we develop a variant of LSH, Ternary Lo- 
cality Sensitive Hashing (TLSH), for solving nearest neigh- 
bor problem in large dimensions using TCAMs and we show 
that it is possible to formulate an almost 'ideal' solution to 
the c-ANNS problem with a space requirement near linear 
in the size of the data base and a query time of 0(1). A 
TCAM is an associative memory where each data "bit" is 
capable of storing one of three states: 0,1,* which we denote 
as ternions. where * is a wildcard that matches both and 
1 ^1]. Thus a TCAM can be considered to be a memory 
of n vectors of w ternions wide. The presence of wildcards 
in TCAM entries implies that more than one entry could 
match a search key. When this happens, the index of the 
highest matching entry (i.e. appearing at the lowest physi- 
cal address) is typically returned. Access speeds for TCAMs 
are comparable to the fastest, most-expensive RAMs. For 
almost a decade, TCAMs have been used in switches and 
routers, primarily for the purposes of route lookup (longest 
prefix matching) [281 137[ and packet classification [281 125| . 
In this paper, we present one application in which c-ANNS 
problem can be solved using a single TCAM lookup using 
a TCAM with width poly(logrz) where n is the size of the 
database by using the TLSH family. 

For TLSH, we use ternary hash functions that hash any 
point in R"* to the set 0,1,*. Analogous to LSH, TLSH has 
property that nearby points are hashed to matching ternions 
with high probability. We obtain a TLSH family by par- 
titioning R'' using randomly oriented randomly translated 
parallel hyperplanes. Alternate regions between the hyper- 
planes are hashed to a *, while the remaining regions are 
hashed to 0,1,0,1.. alternately. 

In order to compare the TLSH family to the LSH fam- 
ily, we consider an example in which we choose a random 
direction Ui (say) and consider a family of hyperplanes or- 
thonormal to it with adjacent hyperplanes separated by 6 
(say). The family of hyperplanes partitions as shown 
in figure [1] We consider a LSH family which hashes the 
region between the hyperplanes to 0, 1, 0, 1 . . . and a TLSH 
family which hashes the region between the hyperplanes to 



0, *, 1, *, 0, * . . . as shown in figure [T] Note that both LSH 
and TLSH project points in R'* on to the random direction 
u. Consider any two points s, q £ K,'' (as shown in figure 
[!}, which are a distance cl apart and whose projections on 
u are separated by t (say) and another point q' £ K,'' at 
a distance of I from s, whose projection on ti is separated 
from that of s by t/c (say). Ideally the notion of locality- 
sensitive hashing is aimed achieving the twin objectives of 
separating far-off points (hash them to opposite bits ) and 
hashing nearby points to matching bits with high probabil- 
ity. However, in this example we see that using a "binary" 
hash function from the LSH family, if the probability of sepa- 
rating s and q (hashing them to opposite bits) is 'i/>(t) (say), 
then informally, the probability of separating s and q' is 
at least ip{t)/c. Thus the "binary" hash function does not 
achieve both objectives simultaneously. On the other hand, 
if we set the distance between the hyperplanes of the TLSH 
family to value more than t/c, any function from the TLSH 
family will not separate s and q'. This is because, any choice 
of translated hyperplanes will ensure that one of the follow- 
ing always happens: 

1. either s and q' are both hashed to or both to 1. 

2. One of s and q' is hashed to a *. 

Thus the ternary hash representations of s and q' always 
match. In this manner, the regions hashed to a * "fuzz" the 
boundaries between regions hashed to O's and I's such that 
ternary hashed representation of nearby points match with 
high probability. 

We leverage the ability of a TCAM to represent the wild- 
card character (*) in order to implement TLSH and store 
the ternary hash signatures generated by it. Also, using the 
property of the TCAM of returning the highest matching 
entry, it is possible to configure this TCAM so that it solves 
a sequence of the (l,c)-Near Neighbor problem, which is 
a decision version of the c-ANNS problem, thus leading to 
a solution of the c-ANNS problem itself (details in section 
14. 2p . Hence, using the TLSH family of hash functions along 
with a TCAM of width poly(logn) where n is the size of 
the database, the c-ANNS problem can be solved in a single 
TCAM lookup [details in Sec 14.2). We beheve this obser- 
vation is very promising with regard to solving similarity 
search problems in streaming environments. Also we note 
that Tao et al. |43, describe a novel method to solve the 
c-ANNS problem without solving a sequence of near neigh- 
bor problems. Their method involves computation of longest 
common prefixes of binary strings. It would be interesting to 
know if their methods can be adapted for use with TCAMs 
in order to avoid solving a sequence of the (1, c)-Near Neigh- 
bor problems. 

We note that this method beats the lower bounds for c- 
ANNs in the cell probe model according to which, any data 
structure nearly linear in n needs f2(log n/ log log n) probes 
in the data base in order to answer approximate nearest 
neighbors accurately [33]. This is because the TCAM im- 
plicitly implements highly parallel operations which do not 
conform to the cell probe model of computation. 

We also present simulations which explore the space of 
design parameters and establish the trade-off involved be- 
tween the size of the TCAM used and the performance of 
our algorithm. We use a combination of real world and 
artificially generated data sets each containing one million 



points in a 64 dimensional Euclidean space. The first data 
set contains randomly generated points (from a suitably cho- 
sen localized region), the second one contains simHash sig- 
natures of web pages belonging to the English Wikipedia 
(from a snapshot of the English Wikipedia in 2005), and the 
third one is again artificially generated in order to maximize 
the number of false positives and false negatives, by having 
many data points on the threshold of being similar or dis- 
similar to a query. From our simulations we observe that a 
TCAM of width 288 bits solves the decision version of the 
2- Approximate Nearest Neighbor problem accurately for the 
aforementioned databases. 

In order to validate our simulations, we design a novel ex- 
periment using TCAMs within a CISCO Catalyst 4500 Eth- 
ernet switch and high speed traffic generators. We demon- 
strate how one can process approximately 1.5M approximate 
nearest neighbor queries per second for each port. Thus, 
it is technically feasible to build devices with TCAMs that 
could serve as high speed similarity engines, in a vein similar 
to using CPUs to accelerate certain classes of application. 
Note that in our case, a TCAM is much more suitable due to 
the combined implicit memory access (lookup) and wildcard 
search done in parallel. 

1.1 Organization 

In section [2] we define the the c- Approximate Nearest 
Neighbor problem, the (Z,c)-Near Neighbor problem, and 
a (Z, It, p;,pu)-TLSH family. In section [3] we describe the 
construction and analysis of a (1, c,pi((5),p2(5/c))-TLSH 
family for any 5 > 0. Section |4] describes the use of a 
(1, c, pi(5),p2(5/c))-TLSH family in solving the (l,c)-Near 
Neighbor problem, (1, c)-Similarity Search problem and c- 
Approximate Nearest Neighbor problem. Section[5]describes 
simulations using a combination of real life and synthetic 
data sets containing a million points in 64 dimensional space, 
which explore the trade-off between the width (size) of the 
TCAM and the performance of the method, along with ex- 
periments which validate our results. Section [6] summarizes 
the related work. We summarize the findings of this paper 
in section [T] 

2. PRELIMINARIES 

First we define the c- Approximate Nearest Neighbor 
Search problem . 

Definition 1. c- Approximate Nearest Neighbor Search or 
the c-ANNS problem: 

Given a set 5* of n points in K,'*, construct a data structure 
which, given a query q G R'' returns a point s € S whose 
distance from q is at most c times the distance between q 
and the nearest neighbor of 5 in 5*. 

Next, we define the TCAM match operation "—t" which 
declares that two sides match if both are equal or one of 
them is a *. 

Definition 2. If A, B G {0, 1, *}, then A^tB if and only 
\i A = B or A = * B = The complementary relation 
is referred to as ^t- 

Definition 3. (/,c)-Near Neighbor problem or the {l,c)- 
NN problem: 

Given a set S of n points in R"^, construct a data structure 
which, given a query point g £ R'*, if there exists a point 



Si E S such that \\si — < then reports "Yes" and a 
point s such that ||s — < cl and if there exists no point 
such that ||s„ — f?||2 ^ then reports "No". 

Note that we can scale down aU the coordinates of points by 
I in which case the above problem needs to be solved only 
for I — 1. Accordingly we discuss the solution of (l,c)-NN 
problem in sectional Also, note that the (Z,c)-NN problem 
is the decision version of the c-ANNS problem. The c-ANNS 
problem can be reduced to 0(log instances of (1, c)-NN 
problems Next analogous to 22], we define a ternary 

locality sensitive hashing family. 

Definition 4- Ternary locality sensitive hashing family 
(TLSH): 

A distribution on a family Q of ternary hash functions 
(i.e. functions which map R'' — > {0, 1,*}) is said to be an 
{l,u,pi,pu)-TLSR ifV2;,yG E'' 

if ||a; - 1/II2 < I then Prn [g{x) =t giy)] > pi, 
if jja; - 1/II2 > u then Prn [g{x) =t g{y)] < Pu- 

where g is drawn from the distribution Q. 

3. DESIGN AND ANALYSIS OF A TLSH 
FAMILY 

In this section we describe the construction and analysis 
of a TLSH family. We will show its application in solving the 
(l,c)-NN problem in sectional Let 5 > be any constant. 
Next, we describe the construction of a family of ternary 
hash functions Qs = {gs '■ H'* — > {0, 1, *}}. 

Each hash function in the family Qs is indexed by a ran- 
dom choice of a and b where a £ R'', individual components 
Oi of a, i = 1 . . . d are chosen independently from the normal 
distribution A/'(0, 1) where A/'(/i, a) denotes a normal distri- 
bution of mean fi and variance a^, and 6 is a real number 
chosen uniformly from (0, 25). We represent each hash func- 
tion in the family Qs as gs,ii,b ■ R"^ — > {0, 1, *} and gs,B.,b 
maps a d dimensional vector onto the set {0, 1, *}. For sake 
of convenience, we drop the subscript 5 from g and refer 
to it as pa^i, which is defined as follows. Given a, b, for any 
a; e R"*, let j = [ '^•'^+'' Jmod(4) where mod denotes the mod- 
ulus function. 



if j = 
if j = 2 
if j = 1 or 3 



5'a,(,(x) 
Pa,6(x) 
3a,6(x) 



Llaving given a formal definition, we give an intuitive de- 
scription this family of hash functions. Consider a parti- 
tion of the space R'' due to the family of hyperplanes or- 
thonormal to a, adjacent planes separated by S and ran- 
domly shifted from the origin by —b. Then the function 
3a,b(x) : R'' — > {0, 1, *} hashes alternate regions to * and the 
remaining regions are hashed to 0, 1 alternately. We show in 
this section that Qs is a TLSH family with parameters that 
are exponentially better than LSH. We show applications of 
this scheme in section U 

Next, we state the following theorem which is the main 
technical contribution of this paper. 

Theorem 1. For all 5 > 2c the family of ternary hash 
functions Qs is a {l,c,pi{S),p2{^))-TLSH family where 



Before proving this theorem, we comment on the im- 
provement that an (1, c,pi((5),p2(5/c))-TLSH family offers 
over a (1, c,p;,p„)-LSH family. One way to compare the 
two hashing schemes is to compare the values of p : = 
log Pi/ log p„. We note that when (5 is large, both pi and 
p„ are close to 1. In fact, for applications in section |4] we 
set (5 = 0(Vlog log n). Hence in this regime we can use the 
the approximations log 1 /p; ~ 1 — p; , log 1 /pu ~ 1 — Pu- 
We get l og Pi/ log p„ ^ (l/c^)e-'''("'-''''(2"'\ If we set 
S — 0{\/log log n) then p decreases to unbounded as a 
function of n. On the other hand, Motwani et al. have 
proved that logpi/logpu > 0.5/c^ for any LSH family [30) . 
Hence we get an unbounded improvement in the parameter 
p of a locality sensitive hashing family by using TLSH in the 
range of parameter 5 which is of interest. 

In order to prove Theorem 3.1 we first introduce some 
notation and prove some subsidiary lemmas. The applica- 
tions in section |4] use the statement of the theorem but are 
independent of the proof. 

Consider two points s, q in R'*. Let x = s — q, x = HxHj. 
Let ^'(x) denote the "collision probability" of s and q, i.e. 
'I'(x) = Prn [g{s) =t g(<l) \ s — q = x] and tp{t) denote the 
collision probability conditioned on the fact that |a • x| = t, 
i.e. i!{t) = Pr [^(s) =T 5(q) | |a- x| = t]. We have *(x) = 
tp{t)nx{t)dt where TTx{t) is the density of the random 
variable |a- x|. Let 'I'(x) = 1 - *(x) and il}{t) = 1 - ip{t). 
Let F denote the complementary cumulative distribution 
function of A/'(0, 1). Let (j^iy) = e'^^"/^^ /{yV2^) - F{y). 
The following lemma proves lower and upper bounds on the 
collision probability vp. 



Lemma 2. For all S > 0, 



< *(x) < 



Proof. First we recall the definition of stability of ran- 
dom variables. A distribution D over R is called p-stable 
if there exists p > such that for any n real numbers 
vi,V2 ■ ■ - Vn and i.i.d. random variables Jfi, X2, . . . Xn with 
distribution 27, the random variable ViXi has the same 
distribution as the variable (J]]^ where X is a ran- 

dom varible with distribution D. Using the well known fact 
that the Normal distribution A/'(0, 1) is 2-stable, we conclude 
that the random variable |a • x| is distributed as x ■ |A/'(0, 1)| 

which implies TTx{t) = ■ 

If two points s, q, are such that |a- x| < S then they 
are hashed to matching TCAM values, i.e. geL,b{s) =t 
5'a.6(q), since the adjacent hyperplanes are at a distance 
of S from each other and alternate regions are hashed to 
* and 0, 1, 0, 1 . . .. Hence ip{t) = if t G (0,5). In fact, if 
t G (0,2(5), then ?/>(t) — ^j^lt>s, where lt>s is the indica- 
tor function {1 if t > S, otherwise). Symmetry about 25 
implies that if t G (25,45), then tp{t) = (2^) lt<ss- Also 
note that the function tp{t) is periodic with period 45. So 
tp{t + AkS) = ^{t), for all positive integers k. 

Now *(x) = ip{t)TVx{t)dt > i>{t)Tix(t)dt. Using 
i){t) = when t G (5, 25), we get 



1" .Is 



— 5 e 2^ , , , 5 , , ,25, 
dt = _ _0 _ 

2o X XX 



and P2iz) 



To prove the latter inequality, we use the fact that tp{t) < 



■^2^, Vt > (5. Hence we have that 



1 e 2^ r /• <s ^ 



□ 



The lemma[3]specifies appropriate bounds for the function 



Lemma 3. The function (f> is bounded above and below as 
follows: 



1 e 2 1 e 2 
—<<P{y)<-y=-^ 



where the first inequality holds if y > 2. Hence for all y > 2, 



(2) 



Proof. The expansion of the error function using inte- 
gration by parts 1 proves ([1]). Using ^ we get (j){y) — 
2 2 

<^(2i/) > (i - |e~T-)-3^e-^. Using y>2 proves the 
lemma. □ 

Now we return to the proof of the main theorem. 
Proof of Theorem 3.1: 



Using lemma [2] and ([T]), Vx < 1, we have 



*(a;) <(p{^ \< 



e 2^ < 



1 e-T 
2^ 53 



(3) 



Also using lemma [2] and > c and 5 > 2c, we have 



*(a;) 



> 
> 

> 



■(f) 
.(f) 



(4) 



This proves the theorem [T] 
□ 

Note that using standard bounds on the complimentary 
cumulative distribution function of the standard normal ran- 
dom variable A/'(0, 1) IJ, the bounds on (fi can be improved 
as follows: Vy, we have 



< Hy) < 



(5) 

It can be verified using standard plotting packages like 
Maple or Matlab that for small values of n and 1/e these 
bounds are in fact tighter than the bounds presented in 
However it is not clear how these stronger bounds can be 
used to obtain an improvement in Theorem 3.1. Analysis 
using these bounds is complicated and moreover, asymptot- 
ically these bounds have the same behaviour as the bounds 
in IT}. Hence we present the analysis using simpler bounds 
as presented in the lemma's above but we recommend the 
use of tighter bounds for parameter tuning and experiments 
as illustrated in section [5] 



4. APPROXIMATE SIMILARITY SEARCH 

In this section we demonstrate the use of Qs to solve the 
(1, c)-NN problem and the c-ANNS problem on a set S con- 
sisting of n points in R'' using a TCAM of width w for some 
appropriate choice of parameters 5 and w. We note that the 
results of this section can be extended to solve the (l,c)- 
SS problem by requiring the TCAM to output all matching 
data points to a query point. 

4.1 The ( 1 , c) -NN Problem 

In this section we formulate an algorithm to solve the 
(l,c)-NN problem. The choice of parameters 5 and w is 
specified later. 

Algorithm A 

• Pre-processing (TCAM Setup): Choose w inde- 
pendent hash functions gi, g2, ■ ■ ■ Qw & Gs where Qs is a 
(1, c, pi((5),p2(<5/c))-TLSH family as defined in section 
[3] For every s^ G S, find its TCAM representation 
T{s^) := {gi{si),g2{si),. . .g^{si)). 

• Query lookup: Given a query q find its TCAM rep- 
resentation T(q) (using the same hash functions). Per- 
form a TCAM lookup of r(q). If the TCAM returns 
a point St such that ||q — st||2 ^ c, return "YES" and 
St, otherwise return "NO". 

Intuitively choosing a large w (i.e. a large no. of hash func- 
tions) reduces the possibility of having false positives in the 
output but at the same time increases the chances of a false 
negative occurring because any one (or more than one) of the 
w TCAM ternions can produce a false negative. Choosing 
a large value of S reduces the false negative probability but 
increases the likelihood of having false positives. We show 
in the following theorem that it is possible to tune these 
parameters simultaneously to ensure that the false negative 
probability is small and the expected number of false posi- 
tives is also small. 

Theorem 4. Consider a set S consisting of n points in 

1. One TCAM lookup: The {l,c)-NN problem can 
be solved by using a TCAM of width w where w = 

0((7 log f ) (log (i log f )) 2 (c^ - iy2)with error 
probability at most e using exactly 1 TCAM lookup and 
1 distance computation in R''. 

2. 0(log(l/e)) TCAM lookups: The {l,c)-NN prob- 
lem can be solved by a TCAM of width w = 

f 3 _3 

0((log n) <:^-i (log log n) 2 (c — 1) 2) with error prob- 
ability at most € using O (log i) TCAM lookups and 
O (log i) distance computations in R'*. 

3. Word size O(logri): If c^ > log (| log ^) where k > 
1/(1— P2(2)), a constant, the {l,c)-NN problem can be 
solved with error probability at most t using a TCAM 
of width k log (n/e). 

Before proving Theorem 4, we discuss the improvements it 
provides over existing methods to solve the (l,c)-NN prob- 
lem. 



Constant separation c: Existing approaches to 
solve the (l,c)-NN problem can be broadly classi- 
fied into three categories depending on their space 
requirements as a function of n: polynomial, sub 
quadratic, and near linear. Using the dimensionality 
reduction approach proposed by Ailon and Chazelle 
[3] and ignoring the dependence on e, it is possible 
to solve the (l,c)-NN problem with a query time of 
0(d log d + (c — 1)"'^ log^ n) using a data structure of 
size d^n'''^^'^^^' ' i.e. polynomial in n. The space 
requirement of optimal in the sense 

that any data structure which solves (1, c)-NN problem 
with a constant number of probes must use n^^^^^'^~^'> ) 
space [5]. However, the extremely large space require- 
ment when c is close to 1 seems to render this approach 
impractical. An alternative approach based on the op- 
timal LSH family [5] proposed by Andoni and Indyk 
can be used to solve the (l,c)-NN Problem using a 
data structure with sub quadratic space requirement 
and a constant probability of success. Their approach 
has a query time of Oidn}^'^ ) and space requirement 

of 0{dn^'^^^'' logn) where the dependence on e has 
been ignored. To the best of our knowledge, their al- 
gorithm minimizes the query time when the size of the 
data structure is limited to be sub quadratic in n. The 
optimal LSH family [5] can also be used to formulate 
an algorithm which solves the (1, c)-NN problem with 
a data structure which is near linear in size and has 
a query time of drf"^^^'^ \ using the algorithm pro- 
posed by Panigrahy [32]. These upper bounds reveal 
the trade off involved between the space requirement 
and the query time while solving the (1, c)-NN problem 
using LSH. In contrast with these results using [Theo- 
rem |4ll], we can formulate a TCAM based data struc- 

ture which has 0((i log f ) (log (i log f )) ^ (c^ - 

word size and solves the (l,c)-NN problem in 
just one TCAM lookup and one distance computa- 
tion in . Ignoring the dependence on e, we con- 
clude that a TCAM based data structure requires 



word size 0(logn' 



(log log 



\3/2 



■11 



-3/2-1 



to 



solve the (1, c)-NN problem with query time 0(1). 
The width of the TCAM varies with e as e-'='/('='-i) 
which leads to large values of the width when e is 
small. One work around is to use a TCAM of width 
0((log n)"'''("'-i' (log log nf'^ (c^ - 1)-^'^) and repeat 
the algorithm 0(log (1/e)) times [Theorem|4l2]. For in- 
stance, n — 10® and c = 2 requires a TCAM of width 
3.3K bits and 1 lookup per query to succeed with prob- 
ability 90% using the tight bounds in ((5]). But allowing 
4 lookups per query, the width of the TCAM required 
can be brought down to 1.7K bits. We explore the 
trade-off between the width of the TCAM and accu- 
racy of algorithm A while using data sets consisting of 
a, n — 10® points in a practical setting in section (5] 

In fact, Panigrahy et al. [33] showed that any data 
structure in the cell probe model which uses a sin- 
gle probe to solve the (1, c)-NN problem with constant 
probability has a space requirement of n^+^^^^'' 
Hence a data structure which uses near linear space 
needs to be probed (logn/loglogn) times. Clearly, 



the TCAM based scheme which uses space 0(n)) and 
query time 0(1) beats this lower bound by implement- 
ing parallel operations which do not conform with the 
cell probe model of computation. 

2. Word size 0{logn): Consider solving the (l,c)-NN 
problem using a RAM of word size w = 0(log n) which 
uses w independent hash functions from the optimal 
(l,c,pi,p„)-LSH family [5]. To solve the (l,c)-NN 
problem with error probability at most e, we need 
the probability of a false negative to be at most e, 
i.e. 1 — ^ f probability of a false posi- 

tive to be at most e i.e. p™ < e/n (since there are at 
most n points with respect to which a false positive 
can occur). This implies that > (i - l)logf. 

Hence > (i) log (n/e). Using the fact that 

^ > 2^ [3ni for any (/, u,p,,p„)-LSH family, we 
get c? = (i log j). Hence "granularity" achieved by 
LSH (ignoring e) in this case n(Vlogn). On the other 
hand using [Theorem|4j3] using a word size of 0(log n), 
algorithm A can solve the (1, c)-NN problem with er- 
ror probability at most e if c = f2 ^ y^log (i log ^) j . 

Thus, ignoring e, the granularity achieved by TCAM 
based scheme is S7(-yioglogn). Hence we see that use 
of TLSH family brings about an exponential improve- 
ment in the "granularity" of a (1, c)-NN problem. 

Again, we note that these huge improvements are brought 
about by the use of a TCAM which has a lot of inherent 
parallelism and hence the lower bounds mentioned before 
do not apply. Next we proceed to prove Theorem 4. 

Proof of Theorem 4: First we make the following claims 
regarding the choice of parameters 5, w which prove the the- 
orem. 
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TCAM lookup: 

J^log(^log(^))) 
Tog(2ji), then algorithm 



we choose 
and w = 
A solves the 



1-P2(S/C) 

(l,c)-NN problem with error probability at most e. 

2. 0(log(l/e)) TCAM lookups: Choosing 5 and w as 
in [Theorem 4,1] with error probability at most 1 /2 and 
repeating algorithm A log (1/e) times solves the (1, c)- 
NN problem with error probability at most e. This can 
in fact be implemented using a single TCAM by using 
the first 0(log (log (1/e))) bits of the TCAM to code 
the version number of the 0(log (1/e)) different data- 
structures to be used to solve the (1, c)-NN problem. 



3. Word size O(logn): Choosing S 
k log ( ^ ) where a is such k — ^ 



etc and w = 
— 7-^ implies algo- 

P2(o!) f & 

rithm A solves the (1, c)-NN problem with error prob- 
ability at most e when c? > log (^ log y)- 

Next we prove these claims sequentially. Let ^(q, c) de- 
note the set of points {s £ S, ||s — > c}. We prove the 
theorem by analyzing the false positive and false negative 
cases. For any query point q G R'*, note that algorithm A 
will solve the (l,c)-NN problem correctly if the following 
two properties hold: 



PI: (No false negative matches) If there exists a sl such 
that II Sl — q||2 1 then TCAM representations of sl 
and q match, i.e. T'(sl) =t T{q). 

P2: (No false positive matches) For any su G S(q, c), TCAM 
representations of su and q do not match, i.e. T(su) t^t 

T(q)- 

1. We will show in the following analysis that it is pos- 
sible to choose the parameters S and w, such that 

w = 0((ilogf)^(log(ilogf))^(c2 - l)-i) and 
both properties PI and P2 hold with probability at 
least 1 — e. This implies that algorithm A succeeds 
with probability at least 1 — 2e. Rescaling e by 1/2, we 
can conclude that algorithm A succeeds with proba- 
bility at least 1 — e. 

ChooscQ S and w such that 



l-P2(S/c) _ 
l-pi(i) 5 ^ 

w — fclog (n/e), 



where k - 



■ log 



l-P2(<5/c) 



(6) 



The choice of k is such that P2(f) = 1 — ^- This 
implies that the false positive probability with re- 
spect to any particular point in ^(q, c) is at most 



1 \ fc log (n/e) 



< 



Hence the ex- 



(p^(!)r = (i-i) 

pected number of false positives in the output of 
Algorithm A is at most e. By Markov inequality, 
the probability that the output of the TCAM is 
a false positive match is at most e. Hence the 
property P2 holds with probability atleast 1 — e. 

• Using ((B]), we get pi{5) — 1 — Hence the prob- 
ability of making a false negative error on any 
ternions is at most e/w. Using the union bound 
implies that probability of a false negative in the 
output of the TCAM is at most e. i.e. PI holds 
with probability at least 1 — e. 

Now using (|6]), we get 

!^ = ^log(J,lo; 
k = 

0((ilog(7))" ' 



(log(ilog(^)))5(c2-l)- 

w = 

0((7log(7))^(log(7l«g(7)))"(c'-l)" 



(7) 

2. Using error probability 1/4 in the analysis of [Theorem 
4,1], we get that the algorithm succeeds with probabil- 
ity at least 1/2 and the width of the TCAM required is 

2 

given by w = O ( (log n)"'^-^ (log log n)^). If this pro- 
cess is repeated ©(logj (1/e)) times, the probability of 
success can be amplified to 1 — e. 

3. Choose 5 — ac and w — fclog (n/e) where a is such 
that fc — — The condition fc > ^ 



1-P2(a) ■ 

that a > 2 and thus S > 2c. 



1-P2(2) 



Again, the choice of fc is such that P2{-) ~ 
Repeating the analysis of [Theorem 4, 1] we get 
that the property P2 holds with probability at 
least 1 — e. 



^ Note that ^^^/(f'' is an increasing function of 5 for a fixed 
c and hence for any n, e,3S which satisfies this condition 



• Now pi{S) = pi(ac) = 1 - > 1 - e " . 

Now if > log (fc/e) log(n/e) i.e. then we have 
pi{S) > 1 — e/w. Again, similar to [Theorem 4, 
1], this implies that PI holds with probability 
atleast 1 — e. Hence algorithm A solves the (1, c)- 
Near Neighbor problem with an error probability 
of at most e using a TCAM of width fc log n when 
c'>log(|logf). 

□ 

4.2 The c-ANNS problem 

Consider a data set S consisting of n points and a query 
point q. Let ro and rmax denote the smallest and largest 
possible distances from q to its nearest neighbor in S and 
let m — [2 log rmarr/j'ol • To solve the c-ANNS problem we 
use a simple (but wealy) reduction [221 118] from c-ANNS to 
m instances of (1, -yc)-NN problem. Next, we describe the 
pre-processing step. Let the parameters S, w be chosen as in 
the analysis of [Theorem 4,1] such that the error probability 
in solving a (1, -yc)-NN problem on S is at most e/m. 
For each i in 1 . . . m: 



1. Let li 



roc 



2. Scale down the coordinates of the data points by U and 
find ternary hash representations of the data points 
using a (1, i/c,pi((5),p2(<5/v^))-TLSH family. 

3. Store the hash representations in the TCAM of width 
w, in order of increasing i. 

The TCAM lookup of the hash representation of q, i.e. 
T(q) (using the same hash functions) is output as the c- 
approximate nearest neighbor. Let 1* denote the distance of 



q to its nearest neighbor in S, i.e. 1* = argmin 



sesi 



-q|l2 



and i* denote the first i in 1 ... m for which U > 1* . Then the 
correct solution {U* , Ui» )-NN problem yields the c-approximate 
nearest neighbor of q. This is because l* > h'-i and the 
output is at a distance of at most ur* = c/i*_i < d* from 
q. The choice of parameters S and w is such that each 
(ii,Ui)-NN problem is solved with an error probability of 
at most e/m. Hence the probability of making an error in 
solving any one of the m the (/i, Ui)-NN problems is at most 
e. This approach can be generalized to using TCAMs with 
smaller widths but 0(m log (1/e)) lookups per query point 
in a manner similar to [Theorem 4.1,2]. As mentioned be- 
fore, Tao et al. [IH] describe a method to solve the c-ANNS 
problem without solving a sequence of near neigbor prob- 
lem, using the computation of longest common prefixes of 
binary strings. It would be interesting to find out if their 
approach can be adapted for use with TCAMs in order to 
avoid solving a sequence of {li,Ui)-NN problems. 

5. SIMULATIONS AND EXPERIMENTS 

In this section we explore the trade-off between the width 
of the TCAM and the performance of the algorithm A. In 
particular we show via simulations that a TCAM of width 



^The weakness of this reduction is because of the possibil 
ity that m mig ht be large or unbounded. We remark that 



the approach in 14.21 cannot be trivially modified to use the 
"adaptive" reduction of c-ANNS to 0(log ^^) instances of 

(/, c)-NN problem proposed by Har-Peled JJj 



288 bits solves the (1,2)-NN problem on practical and ar- 
tificially generated (but illustrative) data sets consisting of 
IM points in 64 dimensional Euclidean space. Finally, we 
also design an experiment with TCAMs inside an enterprise 
ethernet switch (Cisco Catalyst 4500) to show that TLSH 
can be used to configure a TCAM to perform 1.488 million 
queries per second per IGbps port. 

5.1 Simulations: 

We evaluate our algorithm on 3 specific data sets with 
query points generated artificially. Each data set contains a 
million points chosen from a 64 dimensional Euclidean space 
(n = 10^, d — 64). The corresponding query set contains 
IK points generated from a 64 dimensional Euclidean space. 
We list the data sets we used ordered from the most "benign" 
to the "hardest" as follows. 

"Random" data: We chose data points generated uni- 
formly at random from the d-dimensional cube Cd ~ 
{— 2/\/d, 2/\/d} for this data set. We chose half of 
the query points by selecting a random data point s 
(say) and choosing a point uniformly at random from 
the surface of a sphere of radius I centered at s. We 
generated remaining half of the query points uniformly 
at random from Cd- The size of the cube as well as 
the choice of the query points ensured that a signifi- 
cant fraction of the query point - data point pairs were 
separated by a distance either at most I or at least cl. 
(Note that query point - data point pairs such that 
distance between them lies in between / and cl do not 
contribute to either false positives or false negatives 
and can safely be ignored. ) 

Wikipedia data set: The second data set we used is the 

semantically annotated snapshot of the English Wikipedia 
(SW V.2) data set, obtained from Yahoo!. It contained 
a snapshot of the English Wikipedia (from 2005) pro- 
cessed with publicly available NLP tools. We com- 
puted the "simHash" signatures |12l 1271. and embed- 
ded the signatures in a Euclidean spacqj. The query 
points were generated by randomly choosing IK data 
points of the data set and flipping a few (at most sjj 
randomly chosen bits of their simlfash signatures. 

"Threshold" data: The third data set we used was artifi- 
cially designed to maximize the number of false pos- 
itives and false negatives. A single query point was 
generated uniformly at random from Cd- In order max- 
imize the number of false negatives and the number of 
false positives seen, half of the data points were chosen 
to lie on the surface of a sphere <S'i(q) (say) of radius 
I centered at q and the remaining half are chosen to 
lie on the surface of a sphere 5*2 (q) (say) of radius cl 
centered at q (The data points were on the "threshold" 
of being similar and dissimilar to q)- This setup was 
repeated for each of the IK query points and the av- 
erage values of the false negatives and false positives 
observed are reported. 

^We used an appropriate scaling in order to ensure that a 
significant fraction of the query point - data point pairs are 
such that the distance between the query point and the data 
point was either at most / or at least cl 
^The perturbation was chosen according to the experimental 
study of near duplicate detection in web documents [27|. 



Apart from presenting the number of false positives ob- 
served per query and the false negative rate (fraction of false 
negatives observed) as a measure of accuracy, we also report 
the F-score or the Fi-measure [35' of our algorithm which 
is just the harmonic mean of precision and recall. Precision 
is defined as the fraction of retrieved documents that are 
relevant. Recall is the fraction of relevant documents that 
are retrieved. Similar to precision and recall, the Fscore lies 
in the range [0, 1] and a intuitively a high value of F-score 
implies high values of precision and recall. 

To explore the trade-off between accuracy and TCAM 
width, we choose the TCAM widths in the range w = 
32, 64, 96, 128, 144, 160, 192, 224, 256, 288, 320 bits. (Note 
that commercially available TCAMs have 72,144,288 bit 
configurations). As we are interested in an accurate algo- 
rithm, as a design choice we set the the tolerance of the 
false negative rate at e„ = 5% and minimize the number of 
false positives generated under this constraint. For each w, 
we choose S for which the least number of false positives are 
observed while ensuring that the false negative rate is below 
5%. For F-score, we chose the 6 which maximizes the F- 
score using a binary search. We illustrate this procedure for 
a TCAM of width w = 288 bits as shown in Figure [5TT1 As 
expected, increasing S decreases the false negative rate but 
increases the number of false positives and thus generates 
a bell shaped curve for the Fscore. The figure shows that 
there exists an optimal choice of S which minimizes the false 
positives or maximizes the Fscore. We refer to this choice 
of 5 as (5opt. 

5.2 Model 

The process just described for arriving at the optimal 
choice of 5 involves the use of the query points. Hence, the 
optimal value of S can not be precomputed given just the 
database. However, it turns out that only an estimate of the 
distribution of query points gives a good approximation to 
choosing the optimal S- Let ni denote an estimate of the no. 
of data points which are "similar" to the query. Let n2 de- 
note an estimate of the number of data points "dissimilar" to 
the query. Consider a model containing a single query point 
q with ni points on Si{q) and n2 points on 6*2(9). Then 
for a TCAM of width w using the expressions for pi (S) and 
P2{5/c) it is possible to theoretically calculate the expected 
values of the false negative rate, no. of false positives per 
query and the expected f-score for this model and use them 
as a predictions for choosing S- For each data set, we use the 
average number of similar and dissimilar points to a single 
query (by averaging over the IK queries) as ni and 712 in the 
model. The observed values of the false negative rate, num- 
ber of false positives per query, and the f-score as 5 is varied 
were found to closely match those predicted by the model. 
For example, a comparison of these quantities observed for 
the Wikipedia data set as S is varied with those predicted 
by a model for this data set is shown in figure [5711 

5.3 Results and discussion 

We observe that performance of our algorithm i.e. the F- 
score and the number of false positives generated, improves 
as the width (size) of the TCAM is increased as seen in fig- 
ures [S3] and As seen in the figure, the improvements in 
the F-score follow the law of diminishing returns for increas- 
ing TCAM widths and a F-score better than 0.95 is obtained 
using a TCAM of width 288 bits for all the data sets con- 
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Figure 2: Variation of performance measures observed for the Wikipedia data set with S and the corresponding 
model predictions as described in section 15.21 using a TCAM of width w = 288 bits. As we can see in this 
figure, increasing S decreases the false negatives but increases the false positives. Thus there exists an optimal 
choice of 5 which minimizes false positives when false negative rate is below some threshold and maximizes 
the F-score. This figure also shows the comparison between the Wikipedia data set and its corresponding 
model, using a TCAM of width w = 288 bits. 
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Figure 3: Variation of the F-score vs the width of the TCAM: As we can observe from this figure, F-score of 
our algorithm increases with the width of the TCAM used for different data sets. This figure also shows that 
on range of data sets, use of a TCAM of width 288 bits results in a method with the F-score approximately 
0.95. Finally, this figure also shows there is only a slight loss in performance if 5 is precomputed according to 
the model, as opposed to being chosen optimally. 
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Figure 4: Number of false positives per query vs the width of the TCAM with false negative rate capped at 
5%. This figure shows that the number of false positives per query drops rapidly as the width of the TCAM 
is increased with the false negative rate capped at 5%. This suggests that on practical data sets, use of a 
TCAM of width 288 bits generates few false positives 



sidered which intuitively indicates high values of precision 
and recall. Secondly, we note that while tolerating a false 
negative rate of 5%, only 1 false positive was observed per 
query for the "Random" data set. For the Wikipedia data 
set, the number of false positives observed per query was 
14 while for the threshold dataset 51 false positives were ob- 
served, while the false negative rate was below the threshold 
of e„ = 5%. These simulation results suggest that a TCAM 
of width 288 bits can be used to solve the 1, 2-NN problem 
on data sets consisting of a million points. 

We seek solutions in which false negative rate is at most 
5% and the number of false positives generated per query 
by the method is at most 10. For the "Random" data set, 
the use of a 288 bit TCAM actually satisfies these demands, 
while for the Wikipedia data set, the use of a 288 bit TCAM 
comes very close to matching these requirements. Since 288 
bit wide TCAMs containing 0.5M entries are available in the 
market, our method represents a novel yet easy solution to 
the problem of similarity search in high dimensions. Even 
though a larger number of false positives are generated (51) 
by using a 288 bit wide TCAM on the "Threshold" data 
set, we note that this data set was artificially constructed to 
maximize the number of false positives and false negatives 
and we conjecture the property of all the similar points to 
a query being on the "threshold" of being similar and dis- 
similar points being on the "threshold" of being dissimilar 
is unlikely to be observed in practical data sets. We would 
also like to mention here that it is also possible to generate a 
worst case input distribution for the F-score which has just 
a single point similar to a given query point (on the sphere 
Si{q)) and all the remaining data points are dissimilar to 
the query (on the sphere S'2 ((?)). Running the simulations 
on this data set we observed that the performance was not 
too worse than the results presented in this section, even 
though this property (of having a single similar data point 
to a query) is unlikely to be observed in real data sets. 

5.4 Preliminary experimental validation 

In this section, we demonstrate that the simulations of 
TLSH are realistic, and that the TLSH algorithm can be 
made to work with existing TCAM based products at very 
high speeds. For this, we need to choose an appropriate plat- 
form. Although it is possible to use a standalone TCAM 
platform, managing the TCAM in software is non trivial. 
For a preliminary validation, we leverage a Cisco Catalyst 
4500 (Cat4K) series enterprise switch J_3j which uses TCAMs 
for a variety of purposes including implementing access con- 
trol lists (ACLs). In one second, it can support up to a 
billion TCAM lookups and switch 250 million packets. 

Our simple observation is as follows. For validating a 
64-bit TCAM lookup, we map it to an IP address lookup 
in a 64-bit IPv4 access control list. For example, a 64-bit 
lookup key could be represented as a 32-bit IPv4 source and 
a 32-bit IPv4 destination address. This query is embedded 
within an IPv4 source and destination address fields of an 
IP packet and injected into the Cisco switch. Access control 
hsts involve TCAM lookups. The TCAM database is sim- 
ilarly represented as entries of an ACL with permit action 
for matches, i.e. if the TCAM matches a given query, the 
action would be to permit the IP packet and if there is no 
match, the action would be to drop the packet. Thus, all 
egress packets represent queries that had a TCAM hit as 
shown in figure [5741 



We use a high speed commercial traffic generator (from 
IXIA). Though the Cat4K switch can support up to 384 
IGb/s ports, we use two IGb/s ports for this experiment, 
and connect these to two ports of IXIA, which are pro- 
grammable and can inject traffic with specified IP addresses. 
We pass packets from one port and detect egress packets on 
the other via the switch. A switch learns the source and 
the destination for the given hardware MAC addresses of a 
packet (that we set manually) and switches these packets in 
hardware. We inspect the egress packets' IP addresses to de- 
termine which queries hit the TCAM. To ensure the speed, 
we send IP packets (representing queries) at wire speed (i.e. 
1.5 million packets per IGb/s port). 

We validated several randomly generated data sets, for 
32 and 64 bit TLSH lookups. For each data set, we ran- 
domly generate negative, positive and false positive queries 
and the inspect the egress packets' IP addresses. We observe 
that for every positive or false positive query (according to 
TLSH), we do indeed have an egress packet with the cor- 
responding IP address. For every negative query, we never 
detect the corresponding IP packet at egress. We believe 
that this simple experimental setup is novel as it allows us 
to rapidly demonstrate the performance argument without 
the overheads of managing TCAMs! 



6. RELATED WORK 

Early methods to solve similarity search problems in high 
dimensions used the space partitioning approach in order to 
solve the exact nearest neighbor problem by reducing the 
candidate set of data points for a given query, using branch 
and bound techniques. They includes the famous k-d tree 
approach [S], cover trees [TD], navigating nets [23] • How- 
ever an experimental study [32 has showed that approaches 
based on space partitioning scale poorly with the number 
of dimensions d and in fact when d > 10, they performed 
worse than a brute force linear scan for some specific data 
sets (curse of dimensionality). 

Locality sensitive hash (LSH) family was proposed by In- 
dyk and Motwani (25] to solve the c-ANN problem with 
space requirement and query time polynomial in the size 
of the database and the number of dimensions. Given pa- 
rameters /, u, pi and pu, a (i, u, pi,pu)-LSH family of hash 
functions has the following property: The probability that 
two points separated by a distance almost I are hashed to 
the same value is at least pi and probability that two points 
separated by a distance at least u are hashed to the same 
value is at most pu- Gionis et al. [18] showed a framework 
based on a (/, u,p;,p„)-LSH family (where u = d), to solve 
the (Z,c)-Near Neighbor problem in time 0{dn'' \ogn) using 
space 0{dn + ■n}'^'' log n) where p = log p; / log . Their al- 
gorithm used a LSH family with p = 1/c. For the case of 
Euclidean space, the exponent 1/c was improved to /3/c for 
some fixed constant /3 < 1 by Datar et al. [16]. A near lin- 
ear storage space solution was proposed by Panigrahy [32] 
which has space requirement of 0(n) and but a larger query 
time 0{n^''^^^'^) using entropy based techniques along with 
using the LSH family. Building on this work, Lv et al. [26] 
suggested the use of multi-probe LSH methods to reduce the 
number of hash tables required for solving the c-approximate 
nearest neighbor problem f^B^. Andoni and Indyk [5] further 
improved the value of p (for Euclidean space) to 1/c^ +o{l). 




Figure 5: Block diagram of the experimental setup. Queries are inserted in the ipv4 header and pumped into 
the switch using a IXIA traffic generator. ACLs are programmed using the TLSH algorithm and are applied 
to ingress packets. For a ACL match we forward the packet and drop otherwise. Egress packets are collected 
at another IXIA port and they correspond to the matched queries 



This value of p is near-optimal since it matches the lower 
bound for LSH proved by Motwani et al. [30j . 

For c ~ 1, the near quadratic space requirement of the 
optimal LSH could be a hindrance in solving large prob- 
lems like image similarity with millions of images in the 
data set [31]. In fact recent studies have shown that machine 
learning techniques like restricted Boltzmann machines and 
boosting, out perform LSH when the number of bits avail- 
able is small and fixed |36l I40| . Also the query time of 
0{dn}^'' ) makes the application of LSH for proximity based 
methods like clustering and classification difficult in a stream- 
ing environment. Hence, in this paper, we consider the 
use hardware primitives like TCAMs in order to formulate 
fast, space efficient and accurate methods to solve similarity 
search problems. 

While TCAMs have been used previously in order to ob- 
tain efficient solutions to the problem of finding frequent 
elements in data streams [7], we are not aware of any other 
work which uses TCAMs for solving similarity search and 
nearest neighbor problems. 

In parallel, there has been significant progress in proving 
lower bounds for the approximate nearest neighbor problem 
using the cell probe model [12|[6l [33ll42| . In particular Pani- 
grahy, et al. [33] show that a data structure which solves the 
c- ANNS problem using t probes must use space n^^^'-^'^''^ . 
This implies that any data structure that uses 0(n) space 
with poly-logarithmic word size, and with constant proba- 
bility, gives a constant approximation to nearest neighbor 
problem must be probed f2(log n/ log log n) times. We note 
that the use of hardware primitives like TCAMs which im- 
plement highly parallel operations (not conforming to the 
cell probe model of computation) enables us to circumvent 
these lower bounds. 



7. CONCLUSION 

In this paper we have proposed a new method to solve 
the approximate nearest neighbor problem which yields an 
exponential improvement over existing methods. This im- 
provement is brought about by using a hashing scheme which 
does not conform to lower bounds for standard binary hash- 
ing schemes. This hashing scheme (TLSH) is supported by 
a TCAM. In fact using a TCAM of width poly-logarithmic 
in the size of the database, the approximate nearest neigh- 
bor problem can be solved in a single TCAM lookup. Using 
simulations we have shown that off the shelf TCAMs with 
width 288 bits can be used to solve similarity search prob- 
lems on various databases containing a million points in 64 
dimensional Euclidean space. We also design an experiment 
to demonstrate that even existing TCAMs within enterprise 
ethernet switches can perform 1.5M ANN queries per IGbps 
port. Thus, we believe that TCAM based similarity search 
might open new vistas in ultra high speed data mining and 
learning applications. 
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